Task introduction
Text-to-video retrieval aims to automatically locate the video segments most semantically relevant to a given natural language description from a large-scale video database.
Evaluation Dataset
MSR-VTT
Data description:
MSR-VTT stands for Microsoft Research Video to Text, is a large-scale data set containing videos and corresponding text annotations. It consists of 10,000 video clips from 20 categories. Each video clip contains 20 English sentence annotations.
Dataset structure:
Amount of source data:
The dataset is split into train(6513), validation(497), test(2990), each video has 20 captions.
Data detail:
KEYS | EXPLAIN |
---|---|
vid | video |
texts | captions of the video |
Sample of source dataset:
vid:
texts:
- a baker is demonstrating a cooking technique
- a female giving a baking demonstration in her kitchen
- a girl explaining to prepare a dish
- a lady with a scarf is cooking with dough
- a person is preparing some food
- a person making pastries
- a woman is making a pastry
- a woman is rolling doe
- a woman is rolling dough around a stick
- a woman is rolling dough
- a woman is rolling dough
- a woman is wrapping dough around some food item
- a woman rolling up pastry while giving instructions
- a woman rolls dough
- a woman showing an easy way to make crescent rolls
- how to prepare food rolls
- the pastry should have five creases
- a person is preparing some food
- a woman is rolling dough around a stick
- a woman rolls dough
Citation information:
@inproceedings{xu2016msr-vtt,
author = {Xu, Jun and Mei, Tao and Yao, Ting and Rui, Yong},
title = {MSR-VTT: A Large Video Description Dataset for Bridging Video and Language},
year = {2016},
month = {June},
publisher = {IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)},
}
UCF-101
Data description:
UCF101 is a video dataset with 101 action categories collected from YouTube by the University of Central Florida, containing a total of 13,320 videos.
Dataset structure:
Amount of source data:
The dataset is split into train(9537) and test(3783).
Data detail:
KEYS | EXPLAIN |
---|---|
vid | video |
label | the label of the video |
Sample of source dataset:
vid:
label:
Playing Basketball
Citation information:
@article{soomro2012ucf101,
title={UCF101: A dataset of 101 human actions classes from videos in the wild},
author={Soomro, Khurram and Zamir, Amir Roshan and Shah, Mubarak},
journal={arXiv preprint arXiv:1212.0402},
year={2012}
}